Bayes Optimal Decision Rule and Types of Classification Models

Supplementary handout for my presentation in the seminar Applied Statistics at the Department of Mathematics and Natural Sciences, University of Kassel




Introduction

This document is intended to complement the presentation by

  1. providing the proofs discussed and the implementation of Gaussian Naive Bayes in R, and

  2. giving a general overview of the topics.


Topics

  • Bayes Optimal Decision Rule

  • Generative vs. discriminative classification models

  • Introduction of Naive Bayes Classifier with Gaussian distribution as an example of a generative classification model


Orientation Map


This is my own representation and reflects my learning outcomes based on the given literature and with regard to the seminar topic. It is not intended to be exhaustive. For example, there are more generative models than Naive Bayes, and the indicator loss does not necessarily lead to logistic regression in a discriminative learning context.


Key Concepts

Marginalization

For a pair of random variables \((X,Y)\), the density \(f_X(x)\) of a marginal variable \(X\) is obtained from the joint pdf \(f_{X,Y}(x,y)\) by integrating out the other variable:

\[ f_X(x)=\int f_{X,Y}(x,y) \, \mathrm{d}y. \]

Thus, if the joint density is known, the information about the individual variables can be recovered (Kroese et al., 2024, p. 427). This operation is called marginalization.

Conditional Density

Let \(X\) and \(Y\) be random variables with joint pdf \(f_{X,Y}(x,y)\), and assume \(f_X(x)>0\). Then the conditional pdf of \(Y\) given \(X=x\) is

\[ f_{Y \vert X}(y \vert x)= \frac{f_{X,Y}(x,y)}{f_X(x)} \]

(Kroese et al., 2024, p. 431).

Conditional Expectation

In the continuous case, the conditional expectation of a random variable \(Y\) given \(X=x\) is defined as

\[ \mathbb{E}[Y \vert X=x]=\int y \;f_{Y \vert X}(y \vert x) \, \mathrm{d}y \]

(Kroese et al., 2024, p. 431). Replace the integral with a sum for the discrete case. “Note that \(\mathbb{E}[Y \vert X=x]\) is a function of \(x\). The corresponding random variable is written as \(\mathbb{E}[Y \vert X]\)” (Kroese et al., 2024, p. 431).


Proofs

Tower Property


\(\mathbb{E}[\mathbb{1}_Y]=\mathbb{P}(Y)\)

\(\bullet \ (\boldsymbol{X},Y)\) with joint pdf \(f_{\boldsymbol{X},Y}(\boldsymbol{x},y)\)

\(\bullet \ \operatorname{g}:\mathbb{R}^d \to \{0, \ldots, c-1\}\)

\(\bullet \ \operatorname{Loss}(Y, \operatorname{g}(\boldsymbol{X}))=\mathbb{1}\{Y \neq \operatorname{g}(\boldsymbol{X})\}\)



\(\mathbb{E}[\mathbb{1}_Y \vert X]=\mathbb{P}(Y \vert X)\)

\(\bullet \ Y\) with condition to \(\boldsymbol{X}\): \(f_{Y \vert \boldsymbol{X}}(y \vert \boldsymbol{x})\)

\(\bullet \ \operatorname{g}:\mathbb{R}^d \to \{0, \ldots, c-1\}\)

\(\bullet \ \operatorname{Loss}(Y, \operatorname{g}(\boldsymbol{X}))=\mathbb{1}\{Y \neq \operatorname{g}(\boldsymbol{X})\}\)



Optimal Classifier


Classification via Bayes’ Rule

“Following standard practice in Bayesian context, instead of writing \(f_X(x)\) and \(f_{X \vert Y}(x \mid y)\) for the pdf of \(X\) and the conditional pdf of \(X\) given \(Y\), one simply writes \(f(x)\) and \(f(x \mid y)\). If \(Y\) is a different random variable, its pdf (at y) is thus denoted by \(f(y)\)” (Kroese et al., 2024, p. 48).

For a class-dependent probability, Bayes’ Rule is \[f(y \mid x)= \frac{f(x,y)}{f(x)}.\]

The joint pdf \(f(x,y)\) can be expressed as \(f(x \mid y) \, f(y)\). Assume \(\boldsymbol{x}\) is a feature vector. Since \(f(\boldsymbol{x})\) is constant with respect to y, it follows:

\[ f(y \mid \boldsymbol{x}) \propto f(\boldsymbol{x} \mid y) \, f(y). \]

According to Bayes’ notation, \(f(y \mid x)\) is called posterior probability. Further “\(f(x \mid y)\) is the likelihood of obtaining feature vector \(\boldsymbol{x}\) of the class \(y\) and \(f(y)\) is the prior probability of class \(y\)” (Kroese et al., 2024, p. 257). If a feature vector \(\boldsymbol{x}\) is to be assigned to a class \(\hat{y}\), classification is performed according to the Bayes Optimal Decision Rule:

\[ \hat{y}=\mathrm{arg \, max}_y \, f(y \mid \boldsymbol{x}), \]

which is the Bayesian expression of the Optimal Classifier - that is, the class maximizing the (unnormalized) posterior probability.


Naive Bayes

As shown above, Bayesian classification exploits the fact that \(f(x)\) is constant across all \(y\), so it can be omitted in the decision rule. Because the true density of \(f(y \mid \boldsymbol{x})\) is unknown, it is approximated by a function \(\mathrm{g}(y \mid \boldsymbol{x})\) from a specified class of functions \(\mathcal{G}\). The function class \(\mathcal{G}\) can then be endowed with distributional assumptions: “In the naïve Bayes method, the class of approximating functions \(\mathcal{G}\) is chosen such that \(\mathrm{g}(\boldsymbol{x} \mid y) \;=\; \mathrm{g}(x_1 \mid y)\, \mathrm{g}(x_2 \mid y)\,\cdots\, \mathrm{g}(x_p \mid y),\) that is, conditional on the label, all features are independent” (Kroese et al., 2024, p. 258). Assuming a uniform prior over the classes, the factor \(f(y)\) is constant across all \(y\) and can also be omitted:

\[ \mathrm{g}(y \mid \boldsymbol{x}) \propto \prod_{j=1}^p \mathrm{g}(x_j \mid y). \]

Assume the approximating class \(\mathcal{G}\) has a Gaussian distribution, i.e. \((\boldsymbol{X}_j \vert y) \sim \mathcal{N}(\mu_{yj}, \sigma^2)\), \(y=0, \ldots, c-1\), \(j=1, \ldots, p\). The Gaussian pdf is:

\[ f(x)=\frac{1}{\sigma \sqrt{2 \pi}} \; \exp \left({{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}}\right), \]

and insert into naive Bayes factorization:

\[ \mathrm{g}(\boldsymbol{x}_j \mid y) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp{\Big(-\frac{(x_j-\mu_{yj})^2}{2\sigma^2}}\Big). \]

The scaling factor \(\frac{1}{\sqrt{2 \pi \sigma^2}}\) does not depend on \(y\) and therefore drops out in the decision rule. Thus, classification is based on comparing the unnormalized posteriors:

\[ \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \exp \left(-\frac{1}{2}\sum_{j=1}^p \frac{(x_j-\mu_{yj})^2}{\sigma^2} \right). \]

The sum is the Euclidean distance:

\[ \sum_{j=1}^p(x_j-\mu_{yj})^2= \|x-\mu_y\|^2, \]

thus:

\[ \mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x}) \propto \exp \left(-\frac{1}{2}\frac{\|x-\mu_y\|^2}{\sigma^2} \right), \]

where \(\boldsymbol{\theta}\) contains the parameters \(\mu\) and \(\sigma\), which have to be estimated from the data. “The probability \(\mathrm{g}(y \mid \boldsymbol{\theta}, \ \boldsymbol{x})\) is maximal, when \(\|x-\mu_y\|\) is minimal. Thus, \(\hat{y}=\mathrm{arg \ min}_y\ \|\boldsymbol{x}-\boldsymbol{\mu}_y\|\) is the classifier. That is, classify \(\boldsymbol{x}\) as \(y\) when \(\boldsymbol{\mu}_y\) is closest to \(\boldsymbol{x}\) in Euclidean distance” (Kroese et al., 2024, p. 258).


Model Types

There are two fundamental modelling approaches to classification. Generative classification follows the approach already introduced, modelling the joint density \(f(\boldsymbol{x},y)\). From this joint distribution, the posterior \(f(y \mid \boldsymbol{x})\) can be derived, which enables classification. However, modelling the full joint pdf can be difficult in practice, which motivates the simplifying assumptions discussed above.

A different approach is taken by discriminative models. Instead of modelling the full joint distribution, they estimate directly what is needed for classification. This is the conditional probability \(f(y \mid \boldsymbol{x})\) or even only the decision boundary between classes.

Pros and cons of each approach

Some models listed


Application

Gaussian Naive Bayes in R



Exercises

1. Conceptual Questions

a) Clearly distinguish between probability and likelihood.

b) What is the core difference between generative and discriminative models?

2. Model Classification

Several models have been mentioned above. Assign at least one of them to either the generative or discriminative modelling approach.

3. Missing Data

Consider why generatoíve models are generally better suited to handle missing data than discriminative models.


Literature

D.P. Kroese, Z.I. Botev, T. Taimre, R. Vaisman: Data Science and Machine Learning - Mathematical and Statistical Methods. Chapman and Hall/CRC, 2024.

K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.